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Abstract. Motivated by the problem of storing coloured de Bruijn graphs, 
we show how, if we can already support fast select queries on one string, 
then we can store a little extra information and support fairly fast select 
queries on a similar string. 


1 Introduction 

Many compressed data structures for strings rely on three fundamental queries: 
access, rank and select. The query S.access(i) on a string S returns its ith char¬ 
acter; the query 5'.rank a (i) returns the number of occurrences of character a in 
the prefix of S of length i: and the query .Slselectafj) returns the position of the 
jth leftmost occurrence of a in S. Suppose we have a data structure supporting 
these queries on a string Si and we want another data structure supporting them 
on a similar string S- 2 - It is not difficult to store 0(d) extra words, where d is the 
standard edit distance between S\ and S 2 (he., the number of single-character 
insertions, deletions and substitutions needed to change one into the other), and 
support access to any character of S 2 using 0(loglog(|Si| + IS 2 I)) time on top 
of an access query on Si. Last year, when describing their relative FM-index 
data structure, Belazzougui et al. If] showed how to store O(d) extra words and 
support any rank query on S 2 using d(loglog(|5i| + IS 2 I)) time on top of a rank 
query on Si. In this paper we show how to store O(d) extra words and support 
any select query on S 2 using O(loglog(|5i| + |^ 2 1)) time on top of a select query 
on Si. We call this relative select and we expect it to be useful when storing 
compressed data structures for navigating in coloured de Bruijn graphs [9]. 

Belazzougui el al. were interested in saving space when storing FM-indexes [6] 
for many genomes from the same species. An FM-index for a genome is essen¬ 
tially just a data structure supporting access and rank on the Burrows-Wheeler 
Transform [5] (BWT) of that genome. The BWT sorts the characters of a string 
into the lexicographic order of the suffixes that immediately follow them. The 
edit distance between two genomes from the same species tends to be small rel¬ 
ative to their lengths and in practice the edit distance between their BWTs also 
tends to be small. Therefore, if we store the FM-index for one genome normally, 
we can use Belazzougui et al.’s result to save space when storing FM-indexes 
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for other genomes from the same species (at the cost of increasing their query 
times). 

It is possible to support nearly all the functionality of an FM-index without 
using select queries on the underlying BWT, so Belazzougui et al. did not con¬ 
sider relative select. Adding it to their data structure allows us, e.g., to extract 
more quickly the characters following occurrences of a pattern. Our interest in 
relative select, however, comes from Bowe et al.’s [ 4 ] (see also [ 3 ]) compressed 
representation of de Bruijn graphs — which is based on something like an FM- 
index and uses select queries to find nodes’ predecessors, and which we call the 
BOSS representation for the authors’ initials — and the possibility of extending 
it to coloured de Bruijn graphs. Our plan for future work is to view a coloured 
de Bruijn graph as a union of normal de Bruijn graphs, and relatively compress 
the BOSS representations of those graphs. Due to space constraints, we provide 
a brief summary of the BOSS representation and coloured de Bruijn graphs as 
an appendix. In Section [ 2 ] we describe how we implement relative select, and in 
Section [ 3 ] we give experimental evidence that our implementation is practical. 
For simplicity, we assume throughout that the size of the alphabet is constant, 
and we work in the word-RAM model with ,! 2 (log(|Si| -f |/S > 2|))-bit words. 

2 Design 

Although our implementation of relative select is made up of steps that are 
individually very simple, the overall effect might be confusing. To mitigate this, 
we break our presentation into pieces: first, we consider the case when S2 is a 
subsequence of Si; then, we consider the case when S2 is a supersequence of Si; 
and finally, we combine our solutions for these special cases to obtain a general 
solution. We close this section with a small example. 

Lemma 1 . Given a select data structure for a string Si, and a subsequence S 2 
of Si, we can store 0 (|Si| — IS2I) extra words and support any select query on 
S2 using O (loglog |Si|) time on top of a select query on Si. 

Proof. We store a bitvector £?[l..|Si|] with Is marking the characters of Si 
that do not appear in S2. For each distinct character x, we store a bitvector 
B x [l..occ(x, Si)], where occ(x, Si) is the number of occurrences of x in Si, with 
Is marking the occurrences of x in Si that do not appear in S2. This takes a 
total of C>(| Si | — IS2I) extra words and lets us compute 

S2.select3,(z) = R.ranko(Si.select a ,(S s .selecto(i))) 

using O (loglog |Si|) time on top of a select query on Si. To see why this equal¬ 
ity holds, consider that .B^.selecto^) returns the rank in Si of the ith x that 
appears in S2; Si.select x (.B x .selecto(i)) returns the position of that x in Si; and 
R.rank 0 (Si.select x (.B x .selecto(z))) returns the position of that x in S2. □ 

Lemma 2 . Given a select data structure for a string Si, and a supersequence 
S2 of S\, we can store <0 (| S2 1 — |Si|) extra words and support any select query 
on S2 using 0 (loglog |S2 1 ) time on top of a select query on Si. 
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Proof. We store a bitvector S[1..|S2|] with Is marking the characters of S2 that 
do not appear in Si, and a select data structure for the subsequence D of S 2 
consisting of those marked characters. For each distinct character x , we store a 
bitvector B x [l..occ(x, £2)] with Is marking the occurrences of x in S2 that do 
not appear in Si. This takes a total of 0(|S2 — |Si|) extra words and lets us 
compute 


q I , _ / B.se\ect 0 (Si.se\ect x (B x .rank 0 (i))) if B x [i] = 0, 

2 'SeeCxbJ | S.selecti(-D.selectr(.B x .ranki(i))) if B x [i\ = 1. 

using O (loglog IS2Q time on top of a select query on Si. To see why this equal¬ 
ity holds, suppose the ith x in S2 also appears in Si, so B x [i\ = 0. Consider 
that S a ,.ranko(i) returns the rank of that x in Si; Si.selects (LL,.ranko(i)) re¬ 
turns the position of that x in Si; and S.select 0 (Si.select 2 ,(i? x .ranko(i))) re¬ 
turns the position of that x in S2. Now suppose the *th x in S2 does not 
appear in Si, so B x [i\ = 1. Consider that B x .rank\(i) returns the rank of 
that x in D ; I?.select x (i3 a ;.ranki(*)) returns the position of that x in D\ and 
S.selecti(Z3.select a; (i3 z ,.ranki(i))) returns the position of that x in S2. □ 

Theorem 1. Given a select data structure for a string Si, and another string 
S2, we can store 0 (d) extra words, where d is the edit distance between Si and 
S2, and support any select query on S2 using ©(log log(|Si | + IS2I)) time on top 
of a select query on Si. 

Proof. Consider a sequence of d single-character insertions, deletions and sub¬ 
stitutions that turns Si into S2. Let C be the common subsequence of Si and 
S2 consisting of characters left unchanged by these d edits (or a longer common 
subsequence if we can find one). By Lemma [l] we can store ©(d) extra words 
and support any select query on C using O (log log |Si|) time on top of a select 
query on Si. By Lemma [2j we can then store ©(d) extra words and support 
any select query on S2 using ©(loglog IS2I) time on top of a select query on C. 
Therefore, we can store ©(d) extra words on top of the select data structure for 
Si and support any select query on S2 using ©(loglog(|Si| + IS2I)) time on top 
of a select query on Si. □ 

For example, consider the strings Si = TCTGCGTAAAAGGTGC and S2 = 
TGCTCGTAAAACGCG (the BWTs of GCACTTAGAGGTCAGT and GCACTA- 
GACGTCAGT, respectively, from the running example in Belazzougui et al.’s 
paper). Their edit distance is 5 and their longest common subsequence is C = 
TCTCGTAAAAGG. If we already have a select data structure for Si and we 
want one for S 2 , we first add support for relative select on C by the bitvectors 
B,B^, ... ,Bj, shown below; then we add support for relative select on S2 by 
storing bitvectors B ', B A ,..., Bf, also shown below, and a select data structure 
for D = GCC. We note that if we have a relative FM-index for S2 with respect 
to Si, then it already includes B, B' and D. 
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R[1..16] = 0001000000010101 


B'[ 1..15] = 010000000001010 


B a [1.A] = 0000 
B c [l-3] = 001 
B G [1..5] = 10100 
S T [1-4] = 0001 


B' a [1.A] = 0000 
B' c [l.A] = 0011 
B' g [1.A] = 1000 
[1..3] = 000 



'g 


To compute 5 , 2-selectc(4), for instance, we check B' c [ 4] and see it is 1, meaning 
the fourth C in S 2 does not appear in C. Since R(-.ranki(4) = 2, it is the second 
C in D. Since Z?.selectc(2) = 3, it is the third character in D. Finally, since 
i?[.selecti(3) = 14, it is the 14th character in S 2 , meaning S < 2.selectc(4) = 14. 

To compute S 2 -selected), we check B' G [3] and see it is 0, meaning the third 
G in S 2 also appears in C. Since _Bg.ranko(3) = 2, it is the second G in C. Since 


C.selectG(2) = B.rank 0 (S'i.selectG(BG-selecto(2))) = 11, 


it is the 11th character in C. Finally, since F>[.select 0 (ll) = 13, it is the 13th 
character in S 2 , meaning selected) = 13. 

3 Experiments 

We augmented the existing implementation of the Relative FM-index with our 
new select structure. The implementation is written in C++ and based on the 
Succinct Data Structures Library 2.0 [3- We used g++ version 4.8.1 to compile 
the code. Our experiments were run in a computer cluster with two 16-core 
AMD Opteron 6378 processors in each node. The nodes were running Linux 
kernel 2.6.32. Query tests were run on a single core in a dedicated node with no 
other load. 

As our reference sequence, we chose the 1000 Genomes Project’s version of 
the GRCh37 human reference genome, both with (3.096 Gbp) and without (3.036 
Gbp) chromosome Y. For a target sequence, we chose the maternal haplotype 
of the 1000 Genomes Project’s individual NA12878 (3.036 Gbp) [12] ■ We built 
a plain FM-index for the reference sequences and the target sequence, as well 
as relative FM-indexes for the target sequence relative to both references and 
with and without structures for relative select; the lengths of the common subse¬ 
quences used were 2.992 Gbp and 2.991 Gbp, respectively. In all cases, we used 
plain bitvectors in the wavelet trees and entropy-compressed bitvectors m for 
marking the common subsequences. 

To test the performance of relative select, we ran 100 million random U^(i) = 
BWT. select c (i — C[c]) queries on the BWT of the target sequence, using a plain 
FM-index and Relative FM-indexes with and without relative select. (Char¬ 
acter c is the <th character in the BWT in sorted order, while C[c] is the 
number of occurrences of characters smaller than c in the BWT.) The im¬ 
plementation of in the Relative FM-index without relative select was based 
on binary searching with rank queries. As a comparison, we also ran LF(i) = 
C[BWT[?']] + BWT.rank B wT[i](*) queries. Table [l] shows the results: the relative 
FM-indexes without relative select are each about a fifth the size of the normal 
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Table 1 . Average query times for 100 million random LF and W queries on NA12878 
stored relative to the human reference genome, with and without chromosome Y. 



FM-index 


Relative FM-index 

+ Relative Select 

ChrY 

space 

LF 

M/ 

space LF t|/ 

total space 

4/ 

yes 

1090 MB 

0.55 ps 

1.22 ps 

218 MB 3.95 ps 48.0 ps 

382 MB 

6.11 ps 

no 

1090 MB 

0.55 ps 

1.11 ps 

181 MB 3.84 ps 44.8 ps 

331 MB 

6.12ps 


FM-indexes but rank queries are about seven times slower and select queries are 
about forty times slower; the relative FM-indexes with relative select are about 
a third the size of the normal FM-indexes but select queries are only about five 
times slower (rank queries are unaffected). 
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A de Bruijn Graphs 

In biology, the (edge-centric) kth-order de Bruijn graph for a set of strings (e.g., 
DNA reads) is the graph whose nodes are those strings’ fc-mers (substrings of 
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Fig. 1 . Bowe et al.’s augmented de Bruijn graph (left) and matrix (right) for the string 
TACGTCGACGACT; the last column TCCGTGGATAA$C is like a BWT of the edges. 


length At) , with a directed edge (u,v) from u to v if at least one of the strings 
contains a corresponding substring of length k + 1 with u as a prefix and v as 
a suffix. We label (u, v) with the last character of v. Almost all state-of-the-art 
DNA assemblers build contigs via Eulerian assembly Effio] on de Bruijn graphs, 
making their space- and time-efficient representation an important problem in 
bioinformatics. 

Bowe et al. add certain dummy nodes and edges, sort the edges into the right- 
to-left lexicographic order of the nodes they leave, and take the last column of 
the matrix whose rows are the edges in sorted order (or, equivalently, take the 
last character in each edge). The result is like a BWT in which edges correspond 
to characters and nodes correspond to the substrings containing all their out- 
edges’ characters. For example, for the string TACGTCGACGACT and k = 3, 
Bowe et al. add nodes $$$, $$T and $TA and edges S$$T, $$TA and STAC to 
obtain the graph shown on the right side of Figure [T] build the matrix shown 
on the left side of the figure; and take the last column TCCGTGGATAASC. (This 
example is from [3].) With some auxiliary data structures, we can use rank and 
select queries on this edge-BWT to navigate forward and backward in the graph. 

For the two strings TACGTCGACGACT and TACGACGCGACT and k = 3, 
the de Bruijn graph is 2 nodes larger than the graphs for strings separately. 
If we store whether each edge occurs in the first string, the second string, or 
both, then the result is a coloured de Bruijn graph. Coloured de Bruijn graphs 
were introduced by Iqbal et al. [9j for detecting variations between individuals’ 
genomes, and are now also used in other areas of genomics (see, e.g., [2]). We can 
view the coloured de Bruijn graph as the union of each graph consisting of edges 
of the same colour. In a future paper we will show how to combine the BOSS 
representations of the individual de Bruijn graphs to obtain a representation of 
the coloured de Bruijn graph, and also how to relatively compress the auxiliary 
data structures for the BOSS representations of the individual graphs. 

We can use Belazzougui et al.’s result to relatively compress the edge-BWTs 
of the individual graphs while still supporting rank over them. For example, 
the edge-BWTs for TACGTCGACGACT and TACGACGCGACT with k = 3 are 
TCCGTGGATAASC and TCCGTGGACAAS, respectively. They are so close — edit 
distance 2 - because most of the strings’ 4-tuples are common to both and, 
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thus, most of their de Bruijn graphs’ edges are common to both. We note that, 
for reasonable values of k, most of the (fc + l)-mers in genomes from the same 
species should also be common to most of the genomes. In this paper we showed 
how to support relative select on similar strings, which we will eventually need 
to navigate backward across edges in our representation of coloured de Bruijn 
graphs. 



